Code Similarities Beyond Copy & Paste

نویسندگان

  • Elmar Jürgens
  • Florian Deißenböck
  • Benjamin Hummel
چکیده

Redundant source code hinders software maintenance, since updates have to be performed in multiple places. This holds independent of whether redundancy was created by copy&paste or by independent development of behaviorally similar code. Existing clone detection tools successfully discover syntactically similar redundant code. They thus work well for redundancy that has been created by copy&paste. But: how syntactically similar is behaviorally similar code of independent origin? This paper presents the results of a controlled experiment that demonstrates that behaviorally similar code of independent origin is highly unlikely to be syntactically similar. In fact, it is so syntactically different, that existing clone detection approaches cannot identify more than 1% of such redundancy. This is unfortunate, as manual inspections of open source software indicate that behaviorally similar code of independent origin does exist in practice and does present problems to maintenance. 1. Similarity 6= Similarity Research in software maintenance has shown that many programs contain a significant amount of duplicated (cloned) code. Such cloned code is considered harmful for two reasons: (1) multiple, possibly unnecessary, duplicates of code increase maintenance costs [16, 24] and, (2) inconsistent changes to cloned code can create faults and, hence, lead to incorrect program behavior [13]. Obviously, the negative impact of clones on software maintenance is not due to copy&paste but caused by the semantic coupling of the clones. Hence, behaviorally similar code, independent of its origin, suffers from the same problems clones are known for. In fact, the re-creation of existing functionality can be seen as even more critical, since it represents a missed reuse opportunity. The research community has developed a number of successful approaches to detect and manage code duplication. However, the capabilities of existing approaches are not fully clarified yet. While most previous work agrees that the existing approaches are, indeed, limited to detecting copy&pasted (and potentially modified) code, it is sometimes conjectured that they can also find code that is behaviorally similar but has been developed independently. One reason for this uncertainty is that we do not really know how structurally different independently developed code with similar behavior actually is. As a result, it is currently not well understood to which extent real world programs contain redundancy that cannot be attributed to copy&paste although intuition tells us that large projects are expected to contain multiple implementations of the same functionality. To develop a better understanding of redundancy beyond copy&paste, this paper presents the results of an experiment that investigated how well existing clone detection approaches detect similarity in 109 independently developed variations of the same functionality. Strikingly, existing clone detection approaches did not achieve a recall of more than 1% in this experiment although they were run with a very unrestrictive configuration that would yield far too many false positives in practice. Furthermore, we used manual reviews of an open source system to identify if behaviorally similar code that does not result from copy&paste occurs in real world software. This investigation provides a strong indication that this type of redundancy occurs and, indeed, appears to be problematic for software maintenance. Research Problem While clone detection is a proven approach to detect copy&pasted code, it is unclear in how far clone detection can be used to detect code that is behaviorally similar but not the result of copy&paste. Consequently, we currently do not know the recall of clone detection approaches with respect to similar code in general, i. e., not limited to copy&paste. Contribution We extend the existing empirical knowledge with an experiment that demonstrates that behaviorally similar code of independent origin is unlikely to be representationally similar. With this, we show that existing clone detection approaches are ill-suited to detect code that is behaviorally similar but has been developed independently. We illustrate the relevancy of this shortcoming with a case study in which we used manual reviews to identify behaviorally similar code in an open source system. 2. Notions of Similarity In this section we differentiate between representational and behavioral similarity of code. To the best of our knowledge, neither clone detection, nor other research areas concerned with program equivalence (including program schemas [8], refactoring [21] and model checking [4]) provide suitable definitions that can serve as a crisp separation criterion between the two. Hence, we retreat to a more informal but also more intuitive description here. 2.1. Program-Representation-based Similarity Numerous clone detection approaches have been suggested [16, 24]. All of them statically search a suitable program representation for similar parts. Amongst other things, they differ in the program representation they work on and the search algorithms they employ. Consequently, each approach has a different notion of similarity between the code fragments it can detect as clones. We classify them by the type of behavior-invariant variation they can compensate when recognizing equivalent code fragments and by the differences they tolerate between similar code fragments. Text-based approaches detect clones that are equal on the character level. Token-based approaches can perform token-based filtering and normalization. They are thus robust against reformatting, documentation changes or renaming of variables, classes or methods. AST-based approaches can perform grammar-level normalization and are thus furthermore robust against differences in optional keywords or parentheses. PDG-based approaches are somewhat independent of statement order and are thus robust against reordering of commutative statements. In a nutshell, existing approaches exhibit varying degrees of robustness against changes to duplicated code that do not change its behavior. Some approaches also tolerate differences between code fragments that change behavior. Most approaches employ some normalization that removes or replaces special tokens and can make code that exhibits different behavior look equivalent to the detection algorithm. Moreover, several approaches compute characteristic vectors for code fragments and use a distance threshold between vectors to identify clones. Depending on the approach, characteristic vectors are computed from metrics [15,19] or AST fragments [3,9]. Furthermore, ConQAT [13] detects code fragments that differ up to an absolute or relative edit distance as clones. In a nutshell, notions of representational similarity as employed by state of the art clone detection approaches differ in the types of behavior-invariant changes they can compensate and the amount of further deviation they allow between code fragments. The amount of deviation that can be tolerated in practice is however severely limited by the amount of false positives it produces. int x, y, z; z = x∗y ; z = 0 ; whi le ( x > 0) { z += y ; x −= 1 ; } whi le ( x < 0) { z −= y ;

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluation of Duplicated Code Detection Tools in Cross-Project Context

Two or more code segments are considered duplicated when there is a high rate of similarity among them or they are exactly the same. Aiming to detect duplicated code in single software projects, several tools have been proposed. However, in case of cross-project detection, there are few tools. There is little empirical knowledge about the efficacy of these tools to detect duplicated code across...

متن کامل

How are functionally similar code clones syntactically different? An empirical study and a benchmark

Background. Today, redundancy in source code, so-called ‘‘clones’’ caused by copy &paste can be found reliably using clone detection tools. Redundancy can arise also independently, however, not caused by copy&paste. At present, it is not clear how only functionally similar clones (FSC) differ from clones created by copy&paste. Our aim is to understand and categorise the syntactical differences ...

متن کامل

CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code

Copy-pasted code is very common in large software because programmers prefer reusing code via copy-paste in order to reduce programming effort. Recent studies show that copy-paste is prone to introducing bugs and a significant portion of operating system bugs concentrate in copy-pasted code. Unfortunately, it is challenging to efficiently identify copy-pasted code in large software. Existing co...

متن کامل

Clone Detection Beyond Copy&Paste

We argument three positions: 1) independently developed semantically similar code is unlikely to be representationally similar, 2) existing clone detection approaches are ill-suited for detecting such similarities and 3) dynamic clone detection is a promising approach to detect semantically similar yet representationally different code. Numerous clone detection approaches have been proposed [4]...

متن کامل

Ethnographic Study of Copy and Paste Programming Practices in OOPL

When programmers develop and evolve software, they frequently copy and paste (C&P) code from an existing code base, or sources such as web pages or documentation. We believe that programmers follow a small number of well defined C&P usage patterns when they program, and understanding these patterns would enable us to design tools to improve the quality of software. We conducted an ethnographic ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010